Efficiencies for Working with IPEDS Peer Data

Alex McClung

Data Resource Analyst

April 3, 2023

Outline

  1. Automating custom IPEDS Data Center downloads - with documentation!

  2. Improve the speed & efficiency of working w/multiple years of IPEDS “Complete” data files

Motivation

Peer data needed for 14 Yale school profiles

IPEDS Data Center Downloads

Using codegen from Playwright for Python 🎭 to convert interactions into code

Playwright for Python 🎭

Copy, modify, and then run the code from codegen

from playwright.sync_api import Playwright, sync_playwright, expect

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False, slow_mo=50)
    ### RECORD VIDEO
    context = browser.new_context(record_video_dir="videos/")
    page = context.new_page()
    page.goto("https://nces.ed.gov/ipeds/use-the-data")
    page.get_by_role("link", name="Compare Institutions", exact=True).click()
    ### ADD/CHANGE PEER SETS
    page.locator("#tbInstitutionSearch").fill("217156,190150,190415,182670,166027,166683,186131,243744,144050,215062,130794")
    page.get_by_role("button", name="Select").click()
    page.get_by_text("Check All", exact=True).click()
    page.get_by_role("button", name="Continue").click()
    page.get_by_role("link", name="Continue").click()
    page.get_by_text("Admissions and Test Scores").click()
    page.get_by_text("Admissions and test scores", exact=True).click()
    page.get_by_text("Number of applications, admissions, and enrollees").click()
    ### ADD/CHANGE YEARS
    page.get_by_label("2021-22").check()
    page.get_by_label("Applicants total").check()
    page.get_by_label("Admissions total").check()
    page.get_by_label("Enrolled total").check()
    ### SAVE SCREENSHOTS
    page.screenshot(path="admissions-screenshot.png")
    page.get_by_role("img", name="Continue").click()
    page.get_by_role("img", name="Continue").click()
    with page.expect_download() as download_info:
        page.get_by_role("button", name="Continue").click()
    download = download_info.value
    ### EXPORT DATA AS CSV
    download.save_as("ivyplus-admissions.csv")

    # ---------------------
    context.close()
    browser.close()

with sync_playwright() as playwright:
    run(playwright)

IPEDS Complete Data Files

Working with multiple years of data with Arrow (https://arrow.apache.org/)

  • Read multiple CSVs at once (or, as one)

  • Query datasets before reading them into memory

  • Add years from filenames to the dataset

Arrow Example in R

Read multiple CSVs as a “multi-file dataset” with arrow::open_dataset()

library(arrow)
## EXAMPLE: 5 YRS IPEDS DATA FILES (~60 MB EACH)  
dir('data')
[1] "c2017_a.csv" "c2018_a.csv" "c2019_a.csv" "c2020_a.csv" "c2021_a.csv"
## READ ALL FILES IN THE DIRECTORY AT ONCE  
tic()
ds <- open_dataset('data', format = 'csv')
toc()
0.16 sec elapsed
## WHAT KIND OF OBJECT? OBJECT SIZE? 
class(ds); print(object.size(ds), units = "auto")
[1] "FileSystemDataset" "Dataset"           "ArrowObject"      
[4] "R6"               
504 bytes

Arrow Example in R

Query data before reading it into memory

tic()
ds <- open_dataset('data', format = 'csv') %>%
  ## YOUR QUERY HERE  
  filter(UNITID == 130794) %>% 
  group_by(AWLEVEL) %>% 
  summarize(Degrees = sum(CTOTALT, na.rm = TRUE))
toc()
0.13 sec elapsed
## WHAT KIND OF OBJECT? OBJECT SIZE? 
class(ds); print(object.size(ds), units = "auto")
[1] "arrow_dplyr_query"
8.2 Kb

Arrow Example in R

Get year with add_filename() and then collect() data into memory

tic()
open_dataset('data', format = 'csv') %>% 
  filter(UNITID == 130794, MAJORNUM == 1, CIPCODE == 99) %>%
  ## ADD YEAR FROM FILENAME TO DATASET 
  mutate(file_name = add_filename(),
         Year = gsub('.*c([0-9]+)_a.csv', '\\1', file_name)) %>% 
  group_by(Year) %>% 
  summarise(Degrees = sum(CTOTALT, na.rm = TRUE)) %>% 
  collect() ## INTO MEMORY
# A tibble: 5 × 2
  Year  Degrees
  <chr>   <int>
1 2017     4345
2 2018     4494
3 2019     4789
4 2020     4863
5 2021     4121
toc()
6.17 sec elapsed

Additional Uses for Arrow

  • You do not need to use it with multiple files
## READ AND QUERY BEFORE COLLECTING INTO MEMORY  
tic()
degrees_fy21 <- read_csv_arrow('data/c2021_a.csv', as_data_frame = FALSE)
toc()
0.54 sec elapsed
  • Read & write parquet data files, considered the default format ( or “the new CSV” 😎 ) in cloud computing
arrow::open_dataset()
arrow::write_parquet()
arrow::read_parquet()

Thank You!